A biomedical event extraction method based on fine-grained and attention mechanism

Background Biomedical event extraction is a fundamental task in biomedical text mining, which provides inspiration for medicine research and disease prevention. Biomedical events include simple events and complex events. Existing biomedical event extraction methods usually deal with simple events and complex events uniformly, and the performance of complex event extraction is relatively low. Results In this paper, we propose a fine-grained Bidirectional Long Short Term Memory method for biomedical event extraction, which designs different argument detection models for simple and complex events respectively. In addition, multi-level attention is designed to improve the performance of complex event extraction, and sentence embeddings are integrated to obtain sentence level information which can resolve the ambiguities for some types of events. Our method achieves state-of-the-art performance on the commonly used dataset Multi-Level Event Extraction. Conclusions The sentence embeddings enrich the global sentence-level information. The fine-grained argument detection model improves the performance of complex biomedical event extraction. Furthermore, the multi-level attention mechanism enhances the interactions among relevant arguments. The experimental results demonstrate the effectiveness of the proposed method for biomedical event extraction.

According to the BioNLP [4], a biomedical event consists of an event trigger word and a set of arguments. Event trigger is usually a verb or gerund phrase that describe the occurrence of a biomedical event. Each event trigger has a specific type, which represents the event type. Arguments denote the participants of biomedical events, which are generally represented as relation pairs of event triggers and entities or triggers and other events. Therefore, biomedical event extraction aims to identify the event triggers and detect their arguments from the biomedical literature, then construct complete biomedical events. Biomedical events can be divided into simple events and complex events. The simple events usually include one trigger and one argument. The complex events consist of multiple arguments, and there may be nested events, that is, the event arguments are other events. Due to the complexity of complex biomedical event structure, the performance of complex event extraction is relatively low. Figure 1 gives an example provided by BionNLP-ST2013. In the sentence "Bmi-1 overexpression is sufficient to promote tumorigenesis" of Fig. 1, there exists a Gene expression type simple event with a trigger word "over-expression", and a Theme type argument "Bmi-1" which is an entity. In addition, there exists a complex Positive regulation type event, that is event nested with other events with a trigger word "promote". This event has a Theme type argument "tumorigenesis" and a Cause type argument linked to Gene expression event "over-expression".
Many advanced methods have been proposed for biomedical event extraction. The previous work can be divided into three categories: rule-base methods, traditional machine learning approaches and deep learning models. The rule-based methods [8,9] focus the formulation of extraction rules and the generation of pre-defined dictionary, which are time-consuming and difficult to cover all types. Machine learning methods are currently the common approaches for biomedical event extraction. For the MLEE dataset, Pyysalo et al. [10] utilized a SVM classifier for biomedical event extraction, integrating context and dependency features. Zhou et al. [11] proposed a semi-supervised learning model to extract biomedical events by un-annotated corpus and hidden topics. In addition, some researchers pay more attention on the biomedical event trigger identification, which is the sub-task of biomedical event extraction. Zhou et al. [12] obtained biomedical domain knowledge and embedded it into word features, then they combined the embedded features and context features for trigger identification. Our previous work [13] have proposed a two-stage biomedical event trigger detection method, which employed SVM and PA algorithm for classification integrating rich manual features and feature selection. For biomedical event extraction, pipeline-based systems are popular and the feasibility of these methods are verified on many datasets, such as TEES [14,15], EventMine [16]. The aforementioned methods rely on the handcrafted features, and tailor different features for specific task which may require excessive experiments.
In recent years, various neural networks have been applied into biomedical event extraction task successfully. Wang et al. [17] have proposed a CNN architecture for biomedical event extraction. They integrated multiple distributed representation, such as trigger types, POS labels and topic representation. Li et al. [18] employed GRU neural network to extract biotope and bacteria events which focus on detecting the relationship between two mandatory argument, the bacterium and location. They integrated attention mechanism to enhance the important information and employed a domain-oriented word representation. Yan et al. [19] built a bottom-up detection framework based on LSTM to identify the biotope and bacteria events. They trained the context embedding model (VecEntNet) using the annotations of arguments. The context embeddings are further adopted to train the event detection model (VecComNet) for detecting event type and direction. However, the deep learning model adopted in VeComNet is limited by the number of training samples. Abdulkadhar et al. [20] presented a hybrid approach that integrates an ensemble-learning framework by combining a Multiscale Laplacian Graph kernel and a feature-based linear kernel, using a pattern-matching engine to identify biotope and bacteria events. In addition, for the biomedical event trigger detection, Nie et al. [21] proposed a word embeddings assisted neural network prediction model. Wang et al. [22] employed CNN to exploit higher-level features automatically, with N-words and entity mention features around candidate triggers. Rahul et al. [23] utilized bidirectional LSTM and GRU to identify triggers respectively. They extract the higher level features across the sentence. The previous work [24] have proposed a Bi-LSTM model integrating attention mechanism and sentence vector for biomedical event trigger detection. Chen [25] proposed a generalized cross-domain neural network transfer learning architecture and approach, which can share as much knowledge as possible between the source and target domains. More neural networks have focused on the sub-tasks of event extraction, such as event trigger identification [21][22][23][24], and relation classification [26][27][28][29][30]. Most of these deep models achieve superior performance compared to the traditional shallow methods.
It is worth mentioning that the biomedical event extraction task mainly includes two public datasets: MLEE corpus and BioNLP series corpora. The problem of data sparse is serious in BioNLP corpus. For example, in BioNLP 11 data set, the negative instances of trigger words in the training set account for 95% of the total number. Liu et al. [31] pointed out that data sparsity is an important factor affecting the performance of event extraction. In addition, when deep neural network model is used for classification, the context needs to be introduced to obtain the semantic information of the current word. A large number of irrelevant noise information will be introduced when the problem of data sparsity is serious, which may affect the performance of neural network. However, the statistical machine learning-based methods don't need to learn contextual semantic information and features are relatively accurate, so corpus distribution has little significant impact on the performance. Therefore, most biomedical event extraction approaches (including the proposed method) based on neural network employ MLEE corpus as the benchmark dataset, such as references [17,[21][22][23][24][25], and some statistical machine learning-based methods also employ MLEE corpus, such as [10][11][12][13].
Although the above approaches have their notable advantages, certain challenges still remain: (1) The argument structures of simple events and complex events are different. In simple events, the arguments are only the relation pairs of (trigger, entity). However, arguments in the complex events may also be relation pairs of (trigger, trigger). However, existing biomedical event extraction methods usually deal with simple and complex events uniformly, and the performance of complex event extraction is low. (2) The interaction among arguments is not considered, which can improve the performance of complex event extraction. (3) Sentence level information is rarely exploited, which is helpful for detecting some ambiguous event types.
In light of these challenges, we propose a fine-grained biomedical event extraction method integrating sentence embeddings and multi-level attention mechanism. The main contributions are summarized as follows: (1) To improve the performance of complex biomedical event extraction, we design a fine-grained model which deal with simple and complex events respectively. (2) We propose a multi-level attention to enhance the interactions among the relevant arguments, which can further improve the performance of complex event extraction. (3) Sentence embeddings are integrated to exploit global sentence information, which is beneficial to detect some ambiguous event types.

Corpus and evaluation
The commonly used dataset (MLEE) [10] is employed in our experiments. The MLEE corpus covers from the molecular level to the whole organism biomedical organizations. Table 1 illustrate the static distribution of the MLEE dataset. From Table 1, there are 262 event documents, 2608 sentences and 6677 events in total. The biomedical event types are divided into four categories, including Anatomical, Molecular, General and Planned, which can be further divided into 19 sub-classes. As shown in Fig. 2, the four types of complex biomedical events (Regulation, Positive_regulation, Negative_regulation, Binding) occupy a large proportion in the corpus. Therefore, the complex biomedical event extraction is important to improve the overall the performance of biomedical event extraction.
We employ the evaluation criteria with P(recision)/R(ecall)/F(-score). The evaluation metric P/R/F is defined as below (1), where TP, FP and FN are short for True Positives, False Positives and False Negatives respectively.

Hyper-parameters
We combine the train and validation datasets for training, use validation dataset for tuning parameters, and select the average parameters. The size of the word embeddings and sentence embeddings is 200. The number of Bi-LSTM neural network layer is 2, the batch_size is set to 64. The dropout rate is set to 0.5 for avoiding the overfitting. The number of hidden nodes is set to 200, the number of iterations is set to 100. We employ Adadelta as the stochastic-gradient descent algorithm. The learning rate is selected as 0.001 from the set {0.01, 0.001, 0.0001}.

The effectiveness of sentence embeddings
To verify the efficiency of the sentence embeddings established to enrich the global sentence information, we design a baseline model for comparison (Table 2, line 1), which is based on the Bi-LSTM with dependency-based word embddings. We calculate the average and sum value of pre-trained word embeddings only, fine-tuned word embeddings only, the difference or summation between the pre-training word embddings and fine-tuning word embddings respectively. Finally, averaging the difference between the pre-trained word embeddings and fine-tuned word embeddings obtains the best performance. As shown in Table 2 (line 2), the F-score has been increased to 77.96%, improved by 3.75% significantly. This indicates the benefit of sentence embeddings for biomedical event trigger identification.

The effectiveness of word level attention
The word level attention can filter out the irrelevant noise information and enhance the important words in the sentence. As shown in Table 2 (line 3), after integrating word level attention based on the baseline model, the F-score achieves 78.40%. Furthermore, when we integrate both sentence embeddings and word level attention, the model obtains the best performance, achieving 79.96% F-score. This indicates the word level attention can contribute to the task.

The effectiveness of multi level attention
To verify the efficiency of multi level attention, we build three different models as shown in Table 3: Bi-LSTM + Word level attention (line 2), Bi-LSTM + Sentence level attention (line 3), and Bi-LSTM + Multi level attention (line 4). As shown in Table 3, the F-scores of biomedical event extraction with word level attention and sentence level attention are both improved than the baseline Bi-LSTM model (line 1). However, when the multi level attention is integrated, the performance of biomedical event extraction is best, achieving 59.61%. This indicates the effectiveness of the multi level attention.
To further verify the effectiveness of the multi level attention for complex biomedical event extraction, we list the F-scores of 19 biomedical event subclasses integrating word level attention and multi level attention respectively in Table 4. It can be found that, after adding multi level attention, the F-scores of complex biomedical events have been improved significantly than integrating word level attention only. In addition, among the 15 simple event types, the F-scores of 6 types of event extraction with multi level attention are higher than that of the word level attention model only; the F-scores of 6 types of event extraction with multi level attention is the same as or almost equal to that of the word level attention model. Only in Transcription and Phosphorylation types, the word level attention model achieves better performance. However, the two types only account for 0.56% and 0.51% of the total number of events. As Table 4 shown, the performance of simple event extraction is not significantly improved by multilevel attention. It may be because that simple events are composed of one trigger word and one argument, while complex events contain multiple arguments. The sentence level attention mechanism is used to enhance the interaction among multiple relevant arguments with the same trigger word. Therefore, the impact on argument detection of simple events is limited.
In conclusion, the multi level attention can improve the performance of most types of biomedical events extraction, especially for complex biomedical events extraction.

The effectiveness of fine-grained argument detection
According to the difference of argument structure between simple and complex biomedical events, we propose the fine-grained argument detection method. As shown in Table 3 (line 5), the F-score is improved by 0.33%, achieving 59.94%, also the precision and recall are improved. To verify the significance of the fine-grained argument detection model, we conduct a T-test on the results of 10 experiments, and t < 0.05, which means the improvement by fine-grained detection is significant. This indicates that the fine-grained argument detection is beneficial for biomedical event extraction.

Comparisons with other methods
In this section, we list and compare the experimental results of biomedical trigger identification and event extraction with other advanced methods on the commonly used dataset MLEE.

Performance comparisons of trigger identification with other methods
As mentioned in the Related Work, there are some advanced approaches to detect event triggers. They are listed as follows.
SVM1: a SVM based model proposed by Pyysalo et al. [10], which extracted rich hand-crafted features. SVM2: a semi supervised SVM based frame integrating hidden topics and handcrafted features, which is proposed by Zhou et al. [11]. EANNP: a neural network prediction model proposed by Nie [21], which introduced word embedding. CNN: a CNN-based classifier integrating multiple distributed representation, which is proposed by Wang et al. [17]. GRU: a GRU neural network built by Rahul et al. [23], which introduced word and entity type embeddings. LSTM: A LSTM-based model integrating dependency word embeddings and word level attention, which is proposed in our previous work [24]. LSTM + CRF: a LSTM + CRF model proposed by Chen [25], which integrated transfer learning architecture for trigger recognition.
Two-stage Method: A two-stage model proposed in the previous work [13], which is based on traditional machine learning methods. Table 5 shows the comparison results of methods above, and we can find that: (1) The performances of EANNP, CNN, LSTM, GRU, LSTM + CRF and our proposed method are better than SVM classifiers on average F-score. It reveals the effectiveness of deep learning methods, which can obtain high semantic representations without artificial features. (2) The LSTM and GRU models achieve better performance than CNN model, which may verify the sequential model are more suitable for biomedical event extraction.
Since there are usually many long texts in biomedical literature, and the recurrent neural network (LSTM and GRU) can capture global contextual information. (3) Our proposed model outperforms tthe state-of-the-art two-stage method [13].
Our previous two-stage method [13] is based on SVM classifier and PA algorithm, which divided the trigger identification into trigger recognition and trigger classification stages, and need to extract task-based hand-crafted features for each stage. The proposed model only need once classification, and the neural network can skip the step of extracting complex hand-designed features. The results illustrate the effectiveness of our biomedical event trigger identification method.

Performance comparisons of event extraction with other methods
Due to the complexity of biomedical event extraction, the research on event extraction is less than that on trigger identification. Pyysalo et al. [10] proposed a SVMbased approach with rich hand-crafted features. It has significant potential over existing systems, and we select this method as the baseline method. Zhou et al. [11] proposed semi-supervised learning model for biomedical event extraction, which integrated hidden topics embedded in the sentences for describing the distance. Wang et al. [17] employed CNN for biomedical event extraction, which integrated multiple distributed features. The multiple distributed features contain word embeddings, trigger types, POS and topic representation. As shown in Table 6, our proposed method achieves an F-score of 59.94%, which is 1.63% higher than Wang et al. 's [17] CNN methods. The experimental results demonstrate the effectiveness of our proposed method.

Discussion
Experimental results show that the proposed biomedical event extraction method based on fine-grained and multi-level attention has good performance. The detailed analysis for the improvement is as follows:

Sentence embeddings
Sentence embeddings can build the connection among different words and enrich the sentence level information. A sentence usually contains multiple events, which are related to each other. Moreover, there is usually a strong correlation between triggers and arguments, which is beneficial to the recognition of each. The semantic information of triggers or arguments is helpful to resolve the ambiguities for some types. For example, in the sentence "We especially focused on the role of Crk adaptor protein in EphB mediated signaling. ", the correct type of the event triggered by "mediated" is Positive_regulation. However, it might be easily misidentified as a Regulation trigger because in training set it also sometimes appears as a trigger of Regulation event.
In this case, the global sentence level features are important. According to the other word "role" which always exists in Positive_regulation type event, and the word "signaling" which serves as an argument of "mediated", it is more helpful to classify "mediated" correctly. Therefore, we construct the sentence embeddings to enrich global sentence information. The experimental results show that the sentence embeddings have improve the performance of biomedical event detection significantly.

Fine-grained argument detection
In the simple events, there exists only one argument consisting of a trigger word and an entity. In the complex events, there are multiple arguments which consist of trigger word and entity or trigger word and trigger word (nested events). According to the different argument structures of simple and complex events, we propose a finegrained argument detection model. Firstly, we construct different argument candidates for simple and complex events respectively. Then, the same argument types of simple and complex events are labeled, trained and classify separately. Thus the additional relationship between trigger and trigger in nested events is not easy to lose. For example, in the sentence of Fig. 1, besides the arguments of (over-expression, Bmi-1) and (promote, tumorigenesis), the argument relationship of (promote, over-expression) is more easily to identify by the fine-grained argument detection. Therefore, the performance of complex biomedical event argument detection is improved. The experimental results verify the effectiveness of the fine-grained argument detection model.

Multi-level attention
Word level attention focuses on important words within one sentence, and sentence level attention enhances the interaction among sentences. In this work, we define the arguments with same trigger as relevant arguments, and integrate the multi-level attention to enhance the effect among the relevant arguments. The multi-level attention is helpful to identify each other among the relevant arguments. Taking the sentence in Fig. 4 as an example, the type of argument relationship (binding, TRAF2) is Theme. Considering the influence of relevant arguments, it is more easily to correctly judge the type of the argument relationship (binding, CD40) as Theme type. As shown in Table 4, the multi-level attention mechanism improves the performance of complex biomedical event extraction significantly, which proves the effectiveness of muliti-level attention.

Methods
In this paper, we propose a fine-grained biomedical event extraction method based on sentence embeddings and multi-level attention mechanism. Figure 3 illustrates the structures of our model, which mainly contains five parts: (1) Data representation, which combines dependency-based word embeddings and sentence embeddings as input representation. (2) Bi-LSTM integrating reading gate, which is the basis neural network for trigger identification and argument detection. (3) Trigger identification, which divides each event trigger candidate to a concrete event type integrating word level attention. (4) Argument detection, which classifies each event argument candidate to a specific event argument type based on fine-grained detection and multi-level attention. (5) Post-processing, the complete biomedical events are generated by the post-processing.

Dependency-based word embeddings
Different from other NLP tasks, biomedical event extraction needs more information in dependency contexts than in linear contexts [32]. Therefore, we employ Word2vecf [33] to train dependency-based word embeddings as feature representation, which can capture more functional and less topical similarity, yielding more focused embeddings.
In this work, we download about 6G PubMed abstracts (from 2013 to 2019), and parse them with Gdep parser, which is a dependency parse tool specialized for biomedical texts. Then, we derive word contexts in syntactic relations and generate dependency based word embeddings by Word2vecf.

Sentence embeddings
The global information of the sentence is critical to biomedical event extraction. The previous work [24] has demonstrated the effectiveness of sentence embeddings for biomedical event extraction. With similar approach, two different kinds of word embeddings in the whole training process are employed. As (2) shown, x t is the pre-trained dependency-based word embeddings, which can capture the potential feature information from large scale unlabeled corpus. x t ' is the fine-tuned word embeddings which contain rich information associated with the biomedical events. The initial value of xt' is the same as that of pre-trained word embedding xt, and then it will be updated with the neural network training. The sentence vector d 0 is obtained from the average value of the difference between the two aforementioned embeddings of all the word in the sentence, n is the length of the sentence, t refers to the current time, T denotes the total training time, and n is the length of the sentence. To control what information should be

Bi-LSTM integrating reading gate
Bi-LSTM includes the forward LSTM and backward LSTM to better learn the context representation from the two directions. As (3) shown, the forward pass output ( h b t ) and the backward pass output ( h f t ) are combined by summation. Our new Bi-LSTM architecture leveraged by both dependency-based word embeddings x t and fine-tuned word embeddings x t ' is described as (4) to (7). A standard architecture of LSTM mainly consists of three units, which are the input, output, and forget gates respectively. As (8) shown, a reading gate is added to control the sentence embeddings. (9) describes the sentence information at t moment. The cell value c t is modified to (10) after integrating sentence embeddings.
where x is the input embeddings at t moment. i, f, o and c are input gate, forget gate, output gate and the proposed values respectively. w xh is the input connections, w hh is recurrent connections, and b h is the bias value. σ represents the logistic sigmoid function, ⊙ denotes the element-wise multiplication, and c t means the true cell value at time t.

Trigger identification
Trigger identification aims to assign each token or phrase to a specific event trigger type or a negative class if it does not belong to any trigger class. It is usually treated as a multi-classification problem. In this paper, we mark each candidate trigger in a given sentence by BIO labeling method [34]. Then we build a Bi-LSTM trigger identification model, and integrate word level attention to enhance the important word information in the sentence.

Word level attention
According to the analysis of corpus, different words in a sentence usually have different influence in the overall semantic information. Therefore, we integrate word level attention to filter out the irrelevant noise information and enhance the important words. Firstly, we initialize a random weight matrix tuned with the training process. Then, the weight vector could learn word features automatically and record the significant information by increasing the corresponding weights.
As shown in (11), we employ the activation function tanh to handle the final state H(H ∈ R d w ×L ),where L is the sentence length, d w denotes the word embeddings dimension.
In (12), the attention mechanism will produce a vector α of attention weights, where w refers to a trained parameter vector and w T is the transpose of w. Then, in (13), a weighted representation γ is formed by a weighted sum of the output vectors H. At last, the overall semantic information of the sentence is obtained from (14), where h * i represents the final sentence representation. The dimension of α, w , γ and h * is L, d w , d w , d w separately.

Trigger classification
In this work, we treat each token of sentences as a trigger candidate instance. Trained by the Bi-LSTM model based on attention mechanism, the hidden output h * i of each word is generated. Then, we utilize softmax function as classifier to predict label ŷ of each trigger candidate. The classifier takes the hidden output h * i as input: In our model, the objective function is the cross-entropy loss defined as (16). In (16), t j i denotes the j-th type distribution of the i-th instance, and p j i is the predicted distribution.

Argument detection
Argument detection belongs to complex relation classification. In the simple events, argument detection aims to find the relation between predicted trigger and entities in sentence. In the complex events, it aims to find the relation between predicted trigger and entities or other triggers (nested event). Then, if the relation exists, the relation types should be given.

Fine-grained argument detection
Considering the differences of argument structure between simple biomedical events and complex events, we propose the fine-grained argument detection method to further improve the performance of complex biomedical event extraction.
(1) We construct different argument candidates for simple and complex events respectively. For simple events, we take the sentence fragments composed of predicted trigger, entity and other words between them as argument candidate instances. For complex events, the argument candidate instances are composed of predicted trigger, entity/trigger and other middle words. (2) We make a fine-grained distinction between the same type arguments in simple and complex events. For example, we lable the Theme type arguments in simple events as "Theme", lable the same type arguments in complex events as "CTheme", then train and classify them separately. (3) According to the analysis of arguments structure in complex events, we find that the argument relation pairs in complex events have the same trigger, and these arguments usually have strong interaction. For example, as Fig. 4 shown, the argument relation (binding, TRAF3) and (binding, CD40) have the same trigger "binding", also they belongs to the same type Theme, and they are in the same complex event. In addition, the arguments with same trigger in simple events also have common features. Therefore, we define these arguments as relevant arguments, and employ multi level attention to enhance their interaction.
Relevant arguments Arguments containing the same trigger word in biomedical events.

Multi level attention
Word level attention can obtain the key semantic information within a given sentence. Sentence level attention introduces global semantic information, and enhances the interaction among relevant arguments. To take the above advantages, we propose a multi level (word level and sentence level) attention for argument detection.
In this work, the relevant argument instances are represented as vector matrix H * = {h * 1 , h * 2 , · · · , h * M } , where h * i is the hidden output of the word level attention layer, M is the number of relevant instances within the same batch. As shown in (18), after reducing the dimension of h * i , a new vector matrix H * S = {h * S 1 , h * S 2 , · · · , h * S M } representing the sentence feature is generated. As shown in (19)- (22), the weighted hidden output by the sentence attention is obtained, and it will be sent to softmax function for argument prediction.

Argument prediction
To improve the performance of argument detection, the same argument types in simple and complex biomedical events are divided into more fine-grained categories, labeled and classification respectively. After the Bi-LSTM and multi level attention layer, the hidden output of Eq. (22) is sent to softmax function to get the argument candidate type, as shown in (23) and (24).
where W is the learning matrix, b is the bias value, and C denotes the set of argument types. The objective function is the cross-entropy loss function.

Post-processing
Pipeline biomedical event extraction methods include three sub processes: trigger identification, argument detection, and post-processing. The post-processing can remove invalid event candidates and ensure the final events correctly [35]. In this paper, we utilize SVM classifier based on TEES [15] to learn the legal event structure automatically by the extracted features, and then constitute correct event candidates. The features extracted in this process mainly include three categories [36]: linear span features, such as bag-of-words between arguments; argument combination features, such as argument role features and count features; argument content features, such as entity features and argument edge features.